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A note on the 0(n)-storage implementation of 
the GKO algorithm and its adaptation to 
Trummer-like matrices 



Federico Poloni* 



Abstract 

We propose a new 0(n)-space implementation of the GKO-Cauchy 
algorithm for the solution of linear systems where the coefficient matrix 
is Cauchy-like. Moreover, this new algorithm makes a more efficient use 
of the processor cache memory; for matrices of size larger than n ~ 500 — 
1000, it outperforms the customary GKO algorithm. 

We present an applicative case of Cauchy-like matrices with non- 
reconstructible main diagonal. In this special instance, the 0(n) space 
algorithms can be adapted nicely to provide an efficient implementation 
of basic linear algebra operations in terms of the low displacement-rank 
generators. 
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1 Introduction 

Several classes of algorithms for the numerical solution of Toeplitz-likc linear 
systems exist in the literature. We refer the reader to |2U] for an extended in- 
troduction on this topic, with descriptions of each method and plenty of citations 
to the relevant papers, and only summarize them in the following table. 
Name Operations Memory Stability 



Lcvinson 



Schur-Bareiss 

GKO 
Supcrfast 



0(t?) 
0(n 2 ) 



0{n 2 ) 
0(n log 2 n) 



0{n) 

0{n 2 ) 

0{n 2 ) 
0{n) 



stable only for some symmetric 
matrices 

backward stable only for sym- 
metric, positive definite matrices 
stable in practice in most cases 
leading constant may be large; 
may be unstable in the nonsym- 
metric case 

The Levinson algorithm is known to be unstable even for large classes of sym- 
metric positive definite matrices [3]; stabilization techniques such as look-ahead 
may raise the computational cost from 0(n 2 ) to 0(n 3 ) or from 0(nlog 2 n) to 
0(n 2 ). A mixed approach like the one in the classical FORTRAN code by Chan 
and Hansen [11] bounds the complexity growth, but may fail to remove the in- 
stability. The GKO algorithm is generally stabler [5] , even though in limit cases 
the growth of the coefficients appearing in the Cauchy-like generators may lead 
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to instability. Though superfast Toeplitz solvers have a lower computational 
cost, when the system matrix is nonsymmetric and ill-conditioned Oln 2 ) algo- 
rithms such as the GKO algorithm [5] are still attractive. 

In this paper we will deal with the GKO algorithm. It is composed of two 
steps: reduction of the Toeplitz matrix to a Cauchy-like matrix with displace- 
ment rank r = 2, which takes 0(n) memory locations, and 0(n log n) ops, and 
actual solution of the Cauchy-like system via a generalized Schur algorithm, 
which takes 0(n 2 ) ops and 0(n 2 ) auxiliary memory locations. 

In 1994, Kailath and Chun [H] showed that it is possible to express the 
solution of a linear system with Cauchy-like matrix as the Schur complement 
of a certain structured augmented matrix. In 2006, in a paper on Cauchy-like 
least squares problems, G. Rodriguez exploited this idea to design a variation 
of GKO using only 0(n) memory locations. 

In the first part of the present paper, we will provide an alternative 0(n)- 
space implementation of the Schur Cauchy-like system solution algorithm, hav- 
ing several desirable computational properties. 

Moreover, in some applications, a special kind of partially reconstructiblc 
Cauchy-like matrices appear, i.e., those in which the main diagonal is not re- 
constructiblc. We shall call them Trummer-like, as they are associated with 
Trummer's problem [5]. We will show how the 0(n)-storage algorithms adapt 
nicely to this case, allowing one to develop an integrated algorithm for their fast 
inversion. In particular, one of the key steps in order to obtain a full represen- 
tation of their inverse is the calculation of diag(T _1 ) for a given Trummer-like 
T. 

Structure of the paper In Iscction 2l we will recall the concept of displace- 
ment operators, Cauchy-like and Trummer-like matrices. In sections [3] and 0] 
we will study respectively the original GKO algorithm and its first 0(n)-space 
variant due to Rodriguez pQ , |18| . In Iscction 5l we will introduce and analyze 
our new 0(n)-space variant. In Iscction 61 we will deal with system solving and 
matrix inversion for Trummer-like matrices. Finally, Iscction 7l is dedicated to 
showing some numerical experiments that confirm the effectiveness of our ap- 
proach, and Iscction 81 contains some conclusive remarks. 

2 Basic definitions 

Indexing and notation We will make use of some handy matrix notations 
taken from FORTRAN and Matlab®. When M is a matrix, the symbol M i: j tk -j 
denotes the submatrix formed by rows i to j and columns k to I of M, including 
extremes. The index i is a shorthand for i : i, and : alone is a shorthand for 
1 : n, where n is the maximum index allowed for that row/column. A similar 
notation is used for vectors. When v G C" is a vector, the symbol diag(w) 
denotes the diagonal matrix D G C nx ™ such that Z?^ = Uj. On the other hand, 
when M G C" x ™ is a square matrix, diag(M) denotes the (column) vector with 
entries Mi,i, M 2 , 2 , ■ ■ ■ , M nj „. 

Throughout the paper, we shall say that a vector s G C" is injective if 
Si =/= Sj for all i,j = 1,2, ... ,n such that i ^ j. In the numerical experiments, 
we will denote by ||-|| the Euclidean 2-norm for vectors and the Frobenius norm 
for matrices. 
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Displacement operators and Cauchy-like matrices Let t, s € C". We 

shall denote by Vt jS the operator C nx ™ — > C" xn which maps M to 

V t , s (M) = diag(t)M - Mdiag(s). 

A matrix G S C" XTl is said Cauchy-like (with displacement rank r) if there are 
vectors s,t and matrices G E C nxr ,B G C rxrl such that 

V S)t (G)=G5. (1) 

Notice that if we allow r = n, then any matrix is Cauchy-like. In the applica- 
tions, we are usually interested in cases in which r <C n, since the computational 
cost of all the involved algorithms depends on r. 

A Cauchy-like matrix is called a quasi- Cauchy matrix if r = 1, and Cauchy 
matrix if G* = B = (1,1,..., 1). Usually, it is assumed that the operator Vt jS 
is nonsingular, or equivalently, ij ^ Sj for all pairs Under this assumption, 
the elements of G can be written explicitly as 

Cij = (2) 

thus C can be fully recovered from G, B, t and s. Otherwise, the latter formula 
only holds for the entries Cij such that U ^ Sj, and G is said to be partially 
reconstructible. The matrices G and B are called the generators of G, and the 
elements of t and s are called nodes. The vectors t and s are called node vectors, 
or displacement vectors. 



TVummer-like matrices In Iscction 6l we will deal with the case in which 
t = s is injective, that is, when the non-reconstructiblc elements are exactly 
the ones belonging to the main diagonal. We will use V s as a shorthand for 
V S)S - If V S (T) = GB has rank r, a matrix T will be called Trummer-like 
(with displacement rank r). Notice that a Trummcr-likc matrix can be fully 
recovered from G, B, s, and d = diag(T). Trummer-like matrices are related to 
interpolation problems [S], and may arise from the transformation of Toeplitz 
and similar displacement structure |15| . or directly from the discretization of 
differential problems [3]. 



3 Overview of the GKO Schur step 

Derivation The fast LU factorization of a Cauchy-like matrix G is based on 
the following lemma. 

Lemma 1. \16f Let 

(-i _ Gl ; l C\_2:n 
i ■ / i 

satisfy the displacement equation ((T|), and suppose C\ t \ nonsingular. Then its 
Schur complement C^ = G2 : n,2:n~G2 :r i,iGi.i~ 1 Gi ! 2:n satisfies the displacement 
equation 

diag(t 2: „)G( 2 > - G< 2 > diag( S2: „) = G^B^\ 

with 

G^ — G2:n,l:r ~ G2 :n ,lGi,i 1 Gi,i :r , B^ = Bi :r 2- n — -Bl:r,lGi.i 1 C\_2:n- (3) 
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Using this lemma, we can construct the LU factorization of G with 0(n 2 ) 
floating point operations (ops). The algorithm goes on as follows. Given G^> = 
G, B^ = B, and the two vectors s and t, recover the pivot Ci,i, the first row 
Ci,2:n and the first column C^-.n,! of C using the formula @. This allows to 
calculate easily the first row of U as [Ci,i Ci,2:n] and the first column of L 

as [l Gj„ iCi,! -1 ] T . Then use equations ([3]) to obtain the generators G' 2 ' 
and B^ 2 ' of the Schur complement of C. Repeat the algorithm setting 
G <— G' 2 ', B <— B^ 2 \ s <— S2-.n and t <— <2:n to get the second row of U and 
the second column of L, and so on. A simple implementation is outlined in 
| Algorithm 1| Note that for the sake of clarity we used two different variables L 
and U\ in fact, it is a widely used technique to have them share the same n x n 
array, since only the upper triangular part of the matrix is actually used in U, 
and only the strictly lower triangular part in L. When the LU factorization is 



Algorithm 1 LU factorization of Cauchy-like matrices [9] 

Require: G G C" xr , B G C rx ™, t,sG C n {generators of the matrix} 
{temporary variables: L,U G G" x " (can share the same storage space)} 

L «- /„, U <- O n 

for k = 1 to n — 1 do 

U k ,i <- G t; B s f for all £ = k to n 

^ ^fcfef fOT alH = fc + 1 to n 
Gf,: <— Gf. : — L^ : kGk.-. for all £ = k + 1 to n 
B : 'i <- B :/ - U^lB : ' k U k> e for all £ = fc + 1 to n 
end for 



7 (-ri — a 

return L, U 



i X in 



only used for the solution of a linear system in the form Cx = b, with b G C n > 
it is a common technique to avoid constructing explicitly L, computing instead 
L~ 1 b on-thc-fly as the successive columns of L are computed. This is also 



possible with the GKO algorithm, as shown in Algorithm 2 



Comments Notice that [Algorithm 2 includes partial pivoting. Its total cost 



is (4r + 2m + l)n 2 + o(n 2 ) ops, when applied to a matrix G with displacement 
rank r and an n x m right-hand side. The algorithm works whenever G is a 
completely reconstructible Cauchy matrix; if it is not the case, when the number 
of non-reconstructible entries is small, the algorithm can be modified to store 
and update them separately, see e.g. Kailath and Olshevsky [15] or lscction~6l 

However, there is an important drawback in [Algorithm 2[ while the size of 
the input and output data is 0{n) (for small values of m and r), 0(n 2 ) mem- 
ory locations of temporary storage are needed along the algorithm to store U . 
Therefore, for large values of n the algorithm cannot be effectively implemented 
on a computer because it does not fit in the RAM. 

Moreover, another important issue is caching. Roughly speaking, a personal 
computer has about 512 kb-8 Mb of cache memory, where the most recently ac- 
cessed locations of RAM are copied. Accessing a non-cached memory location is 
an order of magnitude slower than a cached one. The real behavior of a modern 
processor is more complicated than this simple model, due to the presence of 
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Algorithm 2 Solving a system Cx = b with implicit L factor and pivoting [9] 

Require: G G C" xr , B G C rx ™, i,s€ C n {generators of the matrix} 
Require: b G C" Xm {right-hand side} 
{temporary variables: / G C™, C/ G C nx ™} 

a; <— b 

for fc = 1 to n — 1 do 

h <- G /' :B: ' fc for all £ = fc to n 

g <— argmax^_ fc n \li\ {Finds pivot position} 

P*~lq {pivot} 

if p=0 then 

print 'error: singular matrix' 
end if 

swap Ik and l q ; Xk,; and x q ^ Gk,-. and G q . : \ tk and t q 
U k ,k <- P 

f/fc ^ <- G ^ B r l for aU ^ = fc + 1 to n 

a;f. : — h(jp Xk : ) for all £ = fc + 1 to n 
G\. <- G £ . : - Z<(p -1 Ga, : ) for alH = A; + 1 to n 
£? : ^ <— _B : ^ — p~ 1 B; t kUkj for all £ = fc + 1 to n 
end for 

JJ ^ G„,:B:,„ 

tn~«n 

%n,-. *- Xn,-./U n ^ n {start of the back-substitution step} 
for fc = n — 1 down to 1 do 

%k,\ Xk,: ~ Uk,£Xe_ : for all £ = fc + 1 to n 

Xk,: <— Xk, : /Uk,k 

end for 
return x 



several different levels of cache, each with its own performance, and instruction 
pipelines [13]. Nevertheless, this should highlight that when the used data do 
not fit anymore into the cache, saving on memory could yield a greater speedup 
than saving on floating point operations. 

4 Low-storage version of GKO: the extended 
matrix approach 

Derivation The following algorithm to solve the high storage issue in GKO 
was proposed by Rodriguez [TS] in 2006, while dealing with least squares Cauchy- 
like problems. More recently, a deeper analysis and a ready-to-use Matlab 
implementation were provided by Arico and Rodriguez [I] . 

The approach is based on an idea that first appeared in Kailath and Chun 
[H]. Let us suppose that C is a completely reconstructible Cauchy-like matrix 
and that s is injectivc. The solution of the linear system Cx = b can be expressed 
as the Schur complement of C in the rectangular matrix 

C b 
-I ' 



c = 
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Thus, we can compute x by doing n steps of Gaussian elimination on C. More- 
over, the first block column of C is a partially reconstructible Cauchy-like matrix 
with respect to ~s = [t T s T ] and t; therefore, while performing the Gaussian 
elimination algorithm, the entries of this block can be stored and updated in 
terms of the generators, as in [Algorithm 1| Unlike the previous algorithms, we 
may discard the rows of U and columns of L as soon as they are computed, 
keeping only the generators. Instead, the entries in the second block column 
are computed with customary Gaussian elimination and stored along all the 
algorithm. 

The following observations, which will be needed later, should make clearer 
what is going on with this approach. 

Lemma 2. Suppose for simplicity that no pivoting is performed; let L and U 
be the LU factors of C, x be the solution to the linear system Cx = b, y be 
the solution to Ly = b, and W = U^ 1 . Let k denote the step of Gaussian 
elimination being performed, with e.g. k = 1 being the step that zeroes out all 
the elements of the first column but the first. During the algorithm, 

1. The entry of the (1,1) block is updated at all steps k with k < 
min(i, j + 1). After its last update, it contains Uij. 

2. The entry of the (1,2) block is updated at all steps k with k < i. 
After its last update, it contains yij. 

3. The (i,j) entry of the (2,2) block is updated at all steps k with k > i. 
In particular, the last step (k = n) updates all entries, and after that the 
(2,2) block contains Xij. 

4- The (i,j) entry of the (2, 1) block is updated at all steps k with i < k < j. 
After its last update, it contains 0. Immediately before that, i.e., just after 
step j — 1, it contains —WijUjj. 

Proof. From the structure of Gaussian elimination, it can easily be verified that 
the entries are only updated during the abovementioned steps. In particular, 
for the condition on updates to the (2,1) block, it is essential that the initial 
(2, 1) block initially contains a diagonal matrix. Regarding which values appear 
finally in each position, 

1. is obvious: in fact, if we ignore all the other blocks, we are doing Gaussian 
elimination on C. 

2. is easily proved: since the row operations we perform transform C = LU 
to U, they must be equivalent to left multiplication by L _1 . 

3. is a consequence of the well-known fact that after n steps of Gaussian elim- 
ination we get the Schur complement of the initial matrix in the trailing 
diagonal block. 

4. is less obvious. Let us call Z^k the value of the (i, k) entry of the (2, 1) 
block right after step k — 1, and consider how the entries of the (2, 2) block 
are updated along the algorithm. They are initially zero, and at the Mh 
step the one in place (i, j) is incremented by —{Zi^/Uk,k)yk,ji so its final 
value is 

Xi,j = - y^X z i,k/Uk,k)yk,j- 

k 
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Since for each choice of b (and thus of y = L x b) and Wj.k are 
unchanged, as they only depend on C, and it holds that 

x i,j = ( U ~ 1 y)ij = ^2w l!k y k ,j, 

k 



the only possibility is that Wi t k = —Zi t k/Uk,k for each i, k. 
We report here the resulting Algorithm 3 



□ 



Algorithm 3 Solving a system Cx = b with the extended matrix algorithm [T] 

Require: G G C" xr , B G C rxn , t,sG C n {generators of the matrix} 
Require: b G C" Xm {right-hand side} 

{temporary variables: I, u G C™} 

.x < — b 

for fc = 1 to n — 1 do 

x G t , : B.., k fo aU ^ = 1 to fc - 1 

/ ^ G /. :B ^ for alH = fc to n 

1 tl-s k 

q <— argmax f=fc n |Z^| {Finds pivot position} 

P i« {pivot} 
if p=0 then 

print 'error: singular matrix' 
end if 

swap Ik and Z^; and x q> .] Gk,-. and G 9 , : ; tk and i 9 

^_ G fc ,:B : ,^ f JJ g = k 1 t 

. <— ■ — ZfXfe ■ for all I ^ k 

G,,:- /' ' G ■ 

G e ,: «- G/, : - /.C/a.: for all £ ^ fc 
B : ^ <— B.^i — p^ l B., : j-ui for all I = k + 1 to n 
end for 

j G^B^k fo all f = 1 to n - 1 

G„ B. „ 



return x 



Comments It is worth mentioning that several nice properties notably sim- 
plify the implementation. 

• The partial reconstructibility of C is not an issue. If the original matrix 
C is fully reconstructible and s is injective, then the non-reconstructible 
entries of C are the ones in the form C(n + k, k) for k = 1, . . . , n, that is, 
the ones in which the —1 entries of the —I block initially lie. It is readily 
shown that whenever the computation of such entries is required, their 
value is the initial one of —1. 
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• At each step of the algorithm, the storage of only n rows of G and of the 
right block column x is required: at step k, we only need the rows with 
indices from k to n + k — 1 (as the ones below are still untouched by the 
algorithm, and the ones above are not needed anymore). It is therefore 
possible to reuse the temporary variables to store the rows modulo n, thus 
halving the storage space needed for some of the matrices. 

• Pivoting can be easily included without destroying the block structure by 
acting only on the rows belonging to the first block row of C. 

[Algorithm 3| uses (6r + 2m + |)n 2 + o(n 2 ) floating point operations, and 
it can be implemented so that the input variables G, B, t, b are overwritten 
during the algorithm, with x overwriting 6, so that it only requires 2n memory 
locations of extra storage (to keep I and u). 

As we stated above, for the algorithm to work we need the additional as- 
sumption that s is injective, i.e., Sj ^ Sj for all j. This is not restrictive when 
working with Cauchy-like matrices derived from Toeplitz matrices or from other 
displacement structured matrices; in fact, in this case the entries Si are the n 
complex nth roots of a fixed complex number, thus not only are they different, 
but their differences Si — Sj can be easily bounded from below, which is impor- 
tant to improve the stability of the algorithm. This is a common assumption 
when dealing with Cauchy matrices, since a Cauchy (or quasi-Cauchy) matrix 
is nonsingular if and only if x and y are injective. For Cauchy-like matrices 
this does not hold, but the injectivity of the two vectors is still related to the 
singularity of the matrix: for instance, we have the following result. 

Lemma 3. Let s have r + 1 repeated elements, that is, s< 1 = Si 2 = ■ ■ ■ = Sj r+1 = 
s. Then the Cauchy-like matrix ([2]) is singular. 

Proof. Consider the submatrix C" formed by the r + 1 columns of C with indices 
ii, . . . ,i r +i- It is the product of the two matrices G' <E C nxr and B' G C rxr+1 , 
with 

G 

( G ')ij = t: ^ - ■ ( B ')ij = B iSj- 

Therefore C' (and thus C) cannot have full rank. □ 

5 Low-storage version of GKO: the downdating 
approach 

Derivation In this section, we shall describe a different algorithm to solve 
a Cauchy-like system using only 0(n) locations of memory. Our plan is to 
perform the first for loop in | Algorithm 2 1 unchanged , thus getting y = L~ x b, but 
discarding the computed entries of U which would take 0{n 2 ) memory locations, 
and then to recover them via additional computations on the generators. 

For the upper triangular system Ux = y to be solved incrementally by back- 
substitution, we need the entry of the matrix U to be available one row at a 
time, starting from the last one, and after the temporary value y = L~ x b has 
been computed, that is, after the whole LU factorization has been performed. 

Let G 1 --^ (£?(*")) denote the contents of the variable G (resp. B) after step k. 
The key idea is trying to undo the transformations performed on B step by step, 
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trying to recover £ W from . Because of the way in which the generators 

are updated in Algorithms [1] and [2J the first row of G^ and the first column 
of B^ are kept in memory untouched by iterations k + 1, . . . , n of the GKO 
algorithm. Thus we can use them in trying to undo the fcth step of Gaussian 
elimination. 

Let us suppose we know B^ k+1 \ i.e., the contents of the second generator 
B after the (k + l)st step of Gaussian elimination, and the values of £?£ . and 

(k) 

B. k , which are written in G and B by the fcth step of Gaussian elimination 



and afterwards unmodified (since the subsequent steps of Algorithm 2 do not 
use those memory locations anymore) . 

We start from the second equation of and © for the kth row of U, 
written using the colon notation for indices. 

B (k+i) = B ( k ) _ B w Ukik -i Uk4t i > k} 

U k ,e = : ' e , £ > k. 

tk - Si 

(k) 

Substituting B / from the first into the second, and using the k — £ case of the 
latter to deal with Uk,k, we get 

r (k)r3(k+i) r (k)„(k) TT -l ^(fe) R (fe+i) 

Uk,e — — : ■ 1 Uk,e = — H i>k,i 

t k - Si t k - Si t k -s e t k - Si 

and thus 

^k ■ t 

U k ,i= *- •* , £>k, (4) 

Sk ~ Si 

= B u +1) + B^Uk^U^i, I > k. (5) 

The above equations allow one to recover the value of B^f for all £ > k using 

only B. 1 ^ 1 ', G^] and B) k ^ as requested. 

By applying the method just described repeatedly for k = n— 1, n — 2, . . . , 1, 
we are able to recover one at a time the contents of B^ n ^ x \ B^ n ^ 2 \ . . . , B^\ 
which were computed (and then discarded) in the first phase of the algorithm. 
I.e., at each step we "downdate" B to its previous value, reversing the GKO 
step. In the meantime, we get at each step k = n — 1, n — 2, . . . , 1 the fcth row 
of U . In this way, the entries of U are computed in a suitable way to solve the 
system Ux = y incrementally by back-substitution. 



We report here the resulting Algorithm 4 



Comments Notice that pivoting only affects the first phase of the algorithm, 
since the whole reconstruction stage can be performed on the pivoted version of 
C without additional row exchanges. 

This algorithm has the same computational cost, (6r + 2m + |)n 2 + o(n 2 ), 
and needs the same number of memory locations, 2n, as the extended matrix 
approach. Moreover, they both need the additional property that s be injective, 
as an Sk — s# denominator appears in These facts may lead one to suspect 
that they arc indeed the same algorithm. However, it is to be noted the two 
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Algorithm 4 Solving a system Cx = b with the downdating algorithm 



Require: G 6 C" xr , B G C rx '\ i, s £ C" {generators of the matrix} 
Require: b £ C nxm {right-hand side} 

{temporary variables: I, u G C n } 

.x < — b 

for fc = 1 to n — 1 do 

k <- 'V ; ' ' for all i = k to n 

1 tl-s k 

q <— argmax^_ fc n |^| {Finds pivot position} 

p <- {pivot} 
if p=0 then 

print 'error: singular matrix' 
end if 

swap ifc and ? g ; and x q ^ : ; Gk,-. and G 9i: ; tk and 
"A; <— P 

^_ G k ,B.„ e f a ll £ = fc + 1 to n 

<— Xf. : — p hxk,-. for all ^ = k + 1 to n 
G it ; <- G f . : - p^kGk,: for all £ = & + 1 to n 
£? : ^ <— _B : .£ — p^B.^kUg for all ^ = A; + 1 to n 
end for 



%n,-. <— x n: :/u n {start of the back-substitution step} 
for k = n — 1 down to 1 do 

£? : ^ <— B- j + u^B-^ue for all £ = fc + 1 to n 
%k,-. *~ a^fe,: — U£X£ t : for £ = k + 1 to n 

Xk,: <- X k: ./u k 

end for 
return x 



algorithms notably differ in the way in which the system Ux = y is solved: 
in the extended matrix approach we solve this system by accumulating the 
explicit multiplication U~ 1 y, while in the downdating approach we solve it by 
back-substitution. 

Several small favorable details suggest adopting the latter algorithm: 

• With the extended matrix approach, we do not get any entry of x before 
the last step. On the other hand, with the downdating approach, as soon 
as the first for cycle is completed, wc get x n , and then after one step 
of the downdating part wc get x n -i, and so on, getting one new compo- 
nent of the solution at each step. This is useful because in the typical 
use of this algorithm on Toeplitz matrices, x is the Fourier transform of 
a "meaningful" vector, such as one representing a signal, or an image, or 
the solution to an equation. Using the correct ordering, the last entries 
of a Fourier transform can be used to reconstruct a lower-sampled pre- 
view of the original data, with no additional computational overhead, see 
e.g. Walker [21], Thus with this approach we can provide an approximate 
solution after only the first part of the algorithm is completed. 

• In the extended matrix version, each step of the algorithm updates 0(nr) 
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memory locations. Instead, in the downdating version, for each k, the 
(n — fc)th and (n + fc)th step work on 0(kr) memory locations. Therefore, 
the "innermost" iterations take only a small amount of memory and thus 
fit better into the processor cache. This is a desirable behavior similar to 
the one of cache- oblivious algorithms [7]. 

• In exact arithmetic, at the end of the algorithm the second generator B of 
the matrix C is reconstructed as it was before the algorithm. In floating 
point arithmetic, this can be used as an a posteriori accuracy test: if one 
or more entries of the final values of B are not close to their initial value, 
then there was a noticeable algorithmic error. 

6 Computations with Trummer-like matrices 

A special class of Cauchy-like matrices which may arise in application [51fT51[51H] 
is that of Trummer-like matrices, i.e., those for which the two node vectors co- 
incide (t = s) and are injectivc. The partial rcconstructibility of these matrices 
requires special care to be taken in the implementation of the solution algo- 
rithms. The 0(n)-storagc algorithms we have presented can be adapted to deal 
with this case. Moreover, it is a natural request to ask for an algorithm that 
computes the generators of T _1 given those of T. Such an algorithm involves 
three different parts: the solution of a linear system with matrix T and multi- 
ple right-hand side; the solution of a similar system with matrix T* , and the 
computation of diag(T _1 ). We will show that an adaptation of the algorithm 
presented in lscction 4l can perform all three at the same time, fully exploiting the 
fact that these three computations share a large part of the operations involved. 

Theoretical results The following results, which are readily proved by ex- 
panding the definition of V s on both sides, are simply the adaptation of classical 
results on displacement ranks (see e.g. Heinig and Rost |12j ) to the Trummer- 
like case. Notice the formal similarity with the derivative operator. 

Theorem 4. Let A,B G C nxi \ and let r(X) = rkV s (A). 

1. V S (A + B)= V S (A) + V S {B), so r(A + B) < r(A) + r(B). 

2. V S (AB) = W S (A)B + A W S (B), so r{AB) < r(A) + r(B). 

3. V^A- 1 ) = -A- 1 V S {A)A~\ so r^" 1 ) = r(A). 

As we saw in Isection 2l a Trummer-like matrix can be completely recon- 
structed by knowing only the node vector s, the generators G and B, and its 
diagonal d = diag(T). In this section, we are interested in implementing fast — 
i.e., using 0(n 2 ) ops — and space-efficient — i.e., using 0(n) memory locations — 
matrix-vector and matrix-matrix operations involving Trummer-like matrices 
stored in this form. 

Matrix- vector product For the matrix- vector product, all we have to do 
is reconstructing one row at a time of the matrix T and then computing the 
customary matrix- vector product via the usual formula (Tv)i = J2j ^ij v j- Ap - 
proximate algorithms for the computation of the Trummer-like matrix-vector 
product with 0(n log 2 n) ops also exist, see e.g. Bini and Pan [2]. 
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Matrix-matrix operations The matrix product between two Trummer-like 
matrices T and S is easy to implement: let Gt and Bt (resp. Gs and Bs) be 
the generators of T (resp. S); then, bv lThcorcm 41 the generators of TS are 



[TG S G T ], 



B s 
BxS 



while diag(T5) can be computed in 0(n 2 ) by recovering at each step one row 
of T and one column of S and computing their dot product. Sums are similar: 
the generators of S + T are 



[Gs Gt] 



B s 
Bt 



and its diagonal is ds + (It ■ 



Linear systems Linear system solving is less obvious. Kailath and Olshevsky 
[T5] suggested the following algorithm: the GKO Gaussian elimination is per- 
formed, but at the same time the computed row Uf~,k-.n and column Lk :n .k are 
used to update the diagonal d to the diagonal of the Schur complement, accord- 
ing to the customary Gaussian elimination formula 

ijr^^-^Sr 1 ^ (6) 

It is easy to see that this strategy can be adapted to both the extended matrix 
and the downdating version of the algorithm, thus allowing one to implement 
GKO with 0{n) storage also for this class of matrices. 

However, a more delicate issue is pivoting. Kailath and Olshevsky do not 
deal with the general case, since they work with symmetric matrices and with 
a symmetric kind of pivoting that preserves the diagonal or off-diagonal posi- 
tion of the entries. Let us consider the pivoting operation before the fcth step 
of Gaussian elimination, which consists in choosing an appropriate row q and 
exchanging the fcth and qtii rows. The main issue here is that the two non- 
rcconstructible entries that were in position T^k and T qq , now are in positions 
T q k and Tk q . This requires special handling in the construction of the fcth row in 
the Gaussian elimination step, but luckily it does not affect the successive steps 
of the algorithm, since the fcth column and row are not used from step k + 1 on- 
wards. On the other hand, the entry T qq , which used to be non-reconstructible 
before pivoting, is now reconstructible. We may simply ignore this fact, store 
it in d and update it with the formula ([6]) as if it were not reconstructible. The 
algorithm is given here as [Algorithm 5| 

A similar modification is also possible with the extended matrix version of 
the algorithm — we shall see it in more detail in the next paragraph. 



Matrix inversion Matrix inversion poses an interesting problem too. The 
generators of T _1 can be easily computed as T~ X G and —BT^ 1 by resorting to 
Lemma [5] applied twice on T and T*. However, whether we try to compute the 
representation of T _1 or directly that of T _1 S for another Trummer-like matrix 
S, we are faced with the problem of computing diag(T _1 ) given a representation 
of T. There appears to be no simple direct algorithm to extract it in time 0(n 2 ) 
from the LU factors of T. 
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A possible solution could be based on the decomposition T -1 = diag(/)+F, 
with / a vector and F a matrix with diag(F) = [0,0,..., 0] T . In fact, notice 
that F depends only on the generators of the inverse; therefore, after computing 
them, one could choose any vector v for which T -1 t; has already been computed 
(e.g., and solve for the entries of / in the equation T~ 1 v = diag(/)w + Fv. 
This solution was attempted in [3] , but was found to have unsatisfying numerical 
properties. 

We shall present here a different solution based on the observations of lThcorcm 21 
that will allow us to compute the diagonal together with the inversion algorithm. 
Let us ignore pivoting in this first stage of the discussion. Notice that the last 
part of lThcorcm 21 shows us a way to compute (U~ )v.k,k at the fcth step of the 
extended matrix algorithm. Our plan is to find a similar way to get (L~ 1 )k,v.k 
at the same step, so that we can compute the sums 

(T- 1 hi = ^2(U- 1 )i !k (L- 1 ) kii , i = l,...,n, (7) 
k 

one summand at each step, accumulating the result in a temporary vector. 
The following result holds. 

Lemma 5. Let T be Trummer-like with generators G and B, nodes s and 
diagonal d, T — LU be its LU factorization, and D — diag(p), where pi — Ui t i 
are the pivots. 

1. The LU factorization ofT*, the transpose conjugate ofT, is (U* D~ l )(DL*). 

2. The matrix T* is Trummer-like with nodes s, diagonal d and generators 
B* and G*. 

3. Let G^ k > and B^ be the content of the variables G and B after the kth 
step of the GKO algorithm on T , and Gr- k ' and B^ ' be the content of the 
same variables after the same step of the GKO algorithm on T* . Then. 
QW = (B^Y and B^ = (G^)* 

Proof. The matrices U*D~ 1 and DL* are respectively unit lower triangular 
and upper triangular. Thus the first part holds by the uniqueness of the LU 
factorization. The second part is clear, and the last one follows by writing down 
the formula © for T and T* . □ 

Therefore, there is much in common between the GKO algorithm on T and 
T* , and the two can be carried on simultaneously saving a great part of the 
computations involved. Moreover, in the same way as we obtain (U~ l )i : k_k, 
we may also get at the fcth step its equivalent for T*, i.e., ((DL*)^ 1 )i-^ = 
Pk(L~ 1 )k ! i:k- Since p k , the fcth pivot, is also known, this allows to recover 

(^ _1 )A:4:fc- 

Thus we have shown a way to recover both (U -1 )^^ and (L~ 1 )k : \-k at the 
fcth step of the extended matrix algorithm, and this allows to compute the fcth 
summand of ([7]) for each i. 

Pivoting How does pivoting affect this scheme for the computation of diag(T -1 )? 
If T = PLU, formula $7$ becomes 

{T-\ i = J2(U- 1 )i,k(L- 1 P- 1 )k, i , i = l,..., n. (8) 

k 
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The permutation matrix P, of which we already have to keep track during the 
algorithm, acts on L^ 1 by scrambling the column indices i, so this does not 
affect our ability to reconstruct the diagonal, as we still have all the entries 
needed to compute the fcth summand at each step k. We only need to take care 
of the order in which the elements of {U~ l )\ : k,k and (£ -1 )fc,i : fc are paired in 



The complete algorithm, which includes pivoting, is reported here as Algorithm 6 



Comments It is worth noting with the same run of the GKO algorithm we 
can compute diag(T ,_1 ) and solve linear systems with matrices T and T* , as the 
two algorithms share many of their computations. In particular, the solutions of 
the two systems giving T _1 G and BT~ X , which are the generators of T -1 , are 
computed by the algorithm with no additional effort: the transformations on 
G and B needed to solve them are exactly the ones that are already performed 
by the factorization algorithm. Also observe that, since the computation of 
(T _1 )i.i spans steps i to n, while di is needed from step 1 to step i, we may 
reuse the vector d to store the diagonal of the inverse. The resulting algorithm 
has a total computational cost of (8r + 2rai + 2to2 + h)n 2 ops, if we solve at 
the same time a system Tx = b with b £ C" xmi and a system yT = c with 
c G c™2xn (otherwise just set mi = and/or = 0). The only extra storage 
space needed is that used for u, I and a, i.e., the space required to store 2n real 
numbers and n integer indices. 

Another observation is that we did not actually make use of the fact that 
the diagonal of T is non-reconstructible: in principle, this approach works even 
if C is a Cauchy matrix with respect to two different node vectors t and s. This 
might be useful in cases in which we would rather not compute explicitly the 
diagonal elements, e.g. because U — S{ is very small, and thus would lead to 
ill-conditioning. 

7 Numerical experiments 

Speed measurements The speed experiments were performed on a FOR- 
TRAN 90 implementation of the proposed algorithms. The compiler used was 
the If 95 FORTRAN compiler version 6.20c, with command-line options -o2 
-tp4 -lblasmtp4. The experiments took place on two different computers: 

CI a machine equipped with four Intel® Xeon™ 2.80Ghz CPUs, each equipped 
with 512kb of L2 cache, and 6 GB of RAM. Since we did not develop a 
parallel implementation, only one of the processors was actually used for 
the computations. 

C2 a machine equipped with one Intel® Pentium® 4 2.80Ghz CPU with 
1024kb of L2 cache, and 512Mb of RAM. 

As an indicative comparison with a completely different algorithm, with 
different stability characteristics, we also reported the computational time for 
the solution of a (different!) Toeplitz system of the same size with the classical 
TOMS729 routine from Chan and Hansen, which is a Levinson-based Toeplitz 
solver. For CI, we reported the times of the Matlab backslash solver as a further 
comparison. 

The results are shown in I Table Tl 
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CI 



C2 



n 


0(n 2 )-space GKO 


Extended matrix 


Downdating 


TOMS729 


Backslash 


128 


1.4868c-03 


2.0160c-03 


2.1330e-03 


9.3602c-04 


3.4408e-03 


256 


1-7 r-7 f\ A(\ AO 

7.7040c-03 


7.6757c-03 


O 1 *7 A A l \ 1 

8.1744e-03 


3.5895c-03 


1.6893e-02 


512 


6.0557c-02 


2.9159e-02 


3.0104e-02 


1.3818e-02 


8.4146e-02 


1024 


2.4182e-01 


1.1922e-01 


1.2351e-01 


5.4675e-02 


4.4138e-01 


2048 


9.6240e-01 


4.5570e-01 


4.6561e-01 


2.5636e-01 


2.6087e+00 


4096 


4.6003c+00 


1.8935c+00 


1.9060e+00 


1.0242c+00 


1.6023c+01 


8192 


3.0398c+01 


9.2616c+00 


8.1778e+00 


5.4107c+00 


Out of memory 


16384 


Out of memory 


4.0561e+01 


3.6369e+01 


2.3309c+01 


Out of memory 


32768 


Out of memory 


1.9054e+02 


1.6358e+02 


1.0447c+02 


Out of memory 


65536 


Out of memory 


7.9284c+02 


7.0361e+02 


4.l7l7e+02 


Out of memory 


n 


0(?i 2 )-space GKO 


Extended matrix 


Downdating 


TOMS729 




128 


1.5779e-03 


2.1675e-03 


2.2298e-03 


4.4989e-04 




256 


6.1923e-03 


8.1009e-03 


8.1240e-03 


l.5978e-03 




512 


6.3149e-02 


3.1964e-02 


3.1461e-02 


5.98l2e-03 




1024 


3.0802c-01 


1.2740c-01 


1.2386e-01 


2.3439e-02 




2048 


1.2481c+00 


5.1163e-01 


4.9413e-01 


9.7404c-02 




4096 


4.8710c+00 


2.0433c+00 


1.9761e+00 


3.8726e-0l 




8192 


Out of memory 


8.4085c+00 


8.0465e+00 


l.5995c+00 




16384 


Out of memory 


3.7214e+01 


3.3758e+01 


7.3925c+00 




32768 


Out of memory 


1.9180c+02 


1.5883e+02 


4.83l5c+0l 




65536 


Out of memory 


9.0789e+02 


7.9906e+02 


2.6792c+02 





Table 1: Speed experiments: CPU times in seconds for the solution of Cauchy- 
likc/Tocplitz linear systems with the different algorithms 



Comments It is clear from the table that two different behaviors arise for 
different sizes of the input. For small values of n, the winner among the GKO 
variants is the traditional 0(n 2 )-space algorithm, due to its lower computational 
cost of (4r + 2m + l)n 2 instead of (6r + 2m + |)n 2 (for these tests, r = 2, 
m =1). As the dimension of the problem increases, cache efficiency starts 
to matter, and the traditional algorithm becomes slower than its counterparts. 
This happens starting from n w 256 — 512. Quick calculations show that the 
memory occupation of the full n x n matrix is 512kb for n — 256 and 2Mb for 
n = 512, so the transition takes indeed place when the 0(n 2 ) algorithm starts 
to suffer from cache misses. 

The three GKO variants are slower than TOMS729; this is also due to the 
fact that the implementation of the latter is more mature than the GKO solvers 
we developed for this test; it uses internally several low- level optimizations such 
as specialized BLAS routines with loop unrolling. 

Accuracy measurements For the accuracy experiments, we chose four test 
problems, the first two inspired from Boros, Kailath and Olshevsky [5], the third 
taken from Gohberg, Kailath and Olshevsky [3], and the fourth from Sweet [T3] 
(with a slight modification). 

PI is a Cauchy-like matrix with r — 2, nodes U = a + ib, Sj = jb for a = 1 
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and 6 = 2, and generators G and B such that Gi.i = 1, G;^ = — 1, 
Bi j = (— iy , i?2j = 2. It is an example of a well-conditioned Cauchy- 
like matrix; in fact, for n = 512, its condition number (estimated with the 
Matlab® function condest) is 4E+02, and for n = 4096 it is 1.3E+03. 

P2 is the same matrix but with a = 1 and 6 = —0.3. It is an ill-conditioned 
Cauchy like matrix; in fact, the condition number estimate is 1E+17 for 
n = 512 and 5E+20 for n = 4096. 

P3 is the Gaussian Tocplitz matrix [5], i.e., the Toeplitz matrix defined by 
Ti_j = a' 1- " , with size n = 512 and different choices of the parameter 
a G (0, 1). It is an interesting test case, since it is a matrix for which the 
Levinson-based Toeplitz solvers are unstable [Hj- Its condition number 
estimate is 7E+09 for a = .90 and 3E+14 for a = 0.93. 

P4 is a Cauchy-like matrix for which generator growth is expected to occur 
[15] . We chose n = 128, the same nodes as PI, and generators defined 
by G = [a, a + ef], B = [a + eg, — a] T , where e = 10~ 12 and a,f,g are 
three vectors with random entries uniformly distributed between and 
1 (generated with the Matlab rand function). Notice that the absolute 
values of the entries of GB is about le-12, and their relative accuracy is 
about le-04. 

For PI and P2, we chose several different values of n, for each of them we 
computed the product v = Ce (where e = [11 . . . 1] T ) with the corresponding- 
matrix C, and applied the old and new GKO algorithms to solve the system 
Cx = v. We computed the relative error as 



As a comparison, for P2 we also reported the accuracy of Matlab's unstructured 
solver (backslash), which is an 0(n 3 ) algorithm based on Gaussian elimination. 

For P3, we solved the problem Tx — v, with T the Gaussian Tocplitz matrix 
and v = Te, for different values of the parameter a, with several different meth- 
ods: reduction to Cauchy-like form followed by one of the three GKO-Cauchy 
solvers presented in this papers, Matlab's backslash, and the classical Lcvinson 
Tocplitz solver TOMS729 by Chan and Hansen [11] . The errors reported in the 
table are computed using the formula @. 

For P4, we generated five matrices, with the same size and parameters but 
different choices of the random vectors a and /. We used the same right-hand 
side and error formula as in the experiments PI and P2. The condition number 
estimates of the matrices are reported as well. 

The results are shown in lTable"2l 

Comments There are no significant differences in the accuracy of the three 
variants of GKO. This shows that, at least in our examples, despite the larger 
number of operations needed, the space-efficient algorithms are as stable as the 
original GKO algorithm. On nearly all examples, the stability is on par with 
that of Matlab's backslash operator. When applied to critical Toeplitz problems, 
the GKO-based algorithms can achieve better stability results than the classical 
Levinson solver TOMS729. 
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In the generator growth case P4, the accuracy is very low, as expected from 
the theoretical bounds; nevertheless, there is no significant difference in the 
accuracy of the three versions. 

We point out that a formal stability proof of the GKO algorithm cannot be 
established, since it is ultimately based on Gaussian elimination with partial 
pivoting, for which counterexamples to stability exist, and since in some limit 
cases there are other issues such as generators growth [T5]: i.e., the growth of 
the elements of G and B (but not of L and U) along the algorithm. However, 
both computational practice and theoretical analysis suggest that the GKO 
algorithm is in practice a reliable algorithm [IT]- Several strategics, such as 
the one proposed by Gu |10j . exist in order to avoid generator growth, and 
they can be applied to both the original GKO algorithm and its space-efficient 
versions. The modified versions of GKO are not exempt from this stability 
problem, since they perform the same operations as the original one plus some 
others; nevertheless, when performing the numerical experiments for the present 
paper, we encountered no case in which the 0(n)-space algorithms suffer from 
generator growth while the original version does not. 

A posteriori accuracy test We tested on the experiment P2 the a pos- 
teriori accuracy test mentioned at the end of Isection 51 i.e., solving the sys- 
tem with the downdating approach and then comparing the values of B before 
and after the algorithm. In ITablc 31 we report the value of the relative error 
|| S — B'\\ j ||-B||, where B' is the value of the variable initially holding the sec- 
ond generator B at the end of the algorithm. We compare it with the relative 
residual \\Cx — b\\ / \\b\\, where C and b are the system matrix and right-hand 
side of the experiment P2, and x is the solution computed by the downdating 
algorithm. At least in this experiment, the proposed test is not able to capture 
the instability of the algorithm; the computation of the relative residual is more 
accurate as an a posteriori test to estimate the accuracy of the solution. 

Speed comparison with Matlab's backslash We have compared the speed 
of the Matlab and Fortran implementations of downdating GKO with the cost 
of assembling the full matrix C in Matlab and solving the system with the 
backslash operator. The results are in lTablc "il The experiments were performed 
on CI, and the version of Matlab used was 7 (R14) SP1. 

The comparison is not meant to be fair: on one hand, we are testing a 0(n 2 ) 
and a 0(n 3 ) algorithm; on the other, we are comparing an interpreted program, 
a compiled program and a call to a native machine-code library within Matlab. 
Starting from n = 2048, even the Matlab version of the downdating algorithm 
is faster than backslash: for large values of n, the overhead of processing 0(n) 
instructions with the Matlab interpreter is amortized. 

Inversion of Trummer-like matrices We shall now turn to testing the 
algorithm for the structured inversion of Trummer-like matrices proposed in 
Isection 61 We chose two experiments, one with well-conditioned matrices and 
one with ill-conditioned ones. Notice that not all possible choices of the genera- 
tors G and B are admissible for a Trummer-like matrix, since the displacement 
equation implies Gi^B-^ = for all i. 
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Tl The n x n Trummer matrix T with = 1, nodes defined by Si = i/n 
and generators defined by G^i = i, G^% = — 1, Bi j = cos(7rz/n), Bj 2 = 
Gi^Bi^ for all i = 1, . . . , 7i. 

T2 The 512 x 512 diagonal-plus-rank- 1 matrix depending on a parameter e 
and defined by T = (1 + e)/ — im T , with u = vj ||u|| and = i/n for 
all i = 1, . . . , n. Its inverse can be computed explicitly as (1 + e)~ x (l + 
e _1 n« T ), and its condition number is e -1 + 1. A diagonal-plus-rank- 1 
matrix is Trammer-like with r — 2 with respect to any set of nodes; in 
this experiment, we used the node vector defined by Sj = a + ib for all 
i = 1, ... ,n, with a = 1, b = —0.3. 

We computed the relative errors 

= \\d'-d"\\ = \\G'-G"\\ \\B'-B"\\ = \\T'-T"\\ 
1 IM"II ' 2 l|G"|| ||B"|| ' 3 ||T"|| ' 

where G',B',d',T' are the generators, diagonal and full inverse computed by 
[Algorithm 6| and G", B", d", T" are their reference values computed with Mat- 
lab's function inv for Tl and with the exact formula for the inverse for T2. 
The results are reported in ITablc 51 In both cases, the algorithm is able to 
reach good accuracy, compatibly with the restrictions imposed by the condition 
number of the matrices. 

Code availability Fortran and Matlab® implementations of the algorithms 
presented here are available online on |http : / / arxiv . org/e-print/0903 . 4569 
along with the e-print of this paper. 

8 Conclusions 

In this paper, we proposed a new 0(n)-space version of the GKO algorithm for 
the solution of Cauchy-likc linear systems. Despite the slightly larger number of 
operations needed, this algorithm succeeds in making a better use of the internal 
cache memory of the processor, thus providing an improvement with respect 
to both the customary GKO algorithm and a similar 0(n)-space algorithm 
proposed by Rodriguez [18], pQ. Starting from n ~ 500 — 1000, the algorithm 
outperforms these two versions of GKO. When applying this algorithm to the 
special case of inversion of Trummer-like matrices, several small optimizations 
reduce the total number of operations needed. 
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Algorithm 5 Solving a system Tx = b with the downdating algorithm 

Require: G G C" xr , B G C rxn , s G C™ {generators of the matrix T} 
Require: d G C" {diagonal of T} 
Require: b G C" xm {right-hand side} 
{temporary variables: l,u G C n } 

{temporary variable: cr G N™ vector of integer indices used to keep track of 

the permutation performed during the pivoting} 

cr(i) <— i for alH = 1 to n {initializes a as the identity permutation} 

for k = 1 to n — 1 do 

j ^ G ^.: g ^ for all i = k + i t0 n 

S<7(t)-Sk 

q <— argmax^_ fcfc+1 n |^| {Finds pivot position} 

P*—lq {piVOt} 

if p=0 then 

print 'error: singular matrix' 
end if 

swap Ik and l q ; Xk,\ and x q ^ Gk,-. and G q . : \ cr(k) and <r{q) 
u k <- p 

^_ g^b^ for all ^ = fc + 1 to n, ^ 7^ o 
u q <— cZg {non-reconstructible entry that moved off-diagonal after pivoting} 

<— %£,-. ~ P~ 1 h%k,-. for all I = k + 1 to n 
g\- <- - v' x kGk- for all i = jfc + 1 to ?i 
-8:,^ -B : / — p~ 1 B^kU£ for all i = k + 1 to n 

di <— di — p~ 1 liui for all I = k + 1 to n{Gaussian elimination on the 
diagonal} 

if g k then {d q may be reconstructible after the pivoting — but we store 

it explicitly anyway} 

^ Gq,.B :iq 

end if 
end for 

G„ B. „ 

W„ ■ 



#n,: x nt ;/u n {start of the back-substitution step} 
for k = n — 1 down to 1 do 

G fc ,B : ,, f all £ = + 1 to n 

Sk-se 

B-^i <— £> : | + u^B.^ui for all £ = fc + 1 to n 
%k,-. *— %k,: — ugX£^ : for £ — k + 1 to n 

Xk,: <- Xk,:/U k 

end for 
return x 
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Algorithm 6 Computing the representation of T 1 (and solving systems with 
matrix T and T*) 

Require: G i C" xr , B i C rx ™, seC" {generators of the matrix T} 
Require: d£C" {diagonal of T} 

Require: b £ C nXmi ,c € C m2X " {for the (optional) solution of Tx = b and 
yT = c} 

{temporary variables: I, u € C",cr G N"} 
x <— 6; ?/ <— c 

er(i) <— i for all i = 1 to n {initializes a as the identity permutation} 
for k — 1 to n — 1 do 

j Gt,:B;,k f aU £ = i to - 1 

; <_ f or all £ = fc + 1 to n 

q <— argmax^_ fe fe+1 n {Finds pivot position} 

P*—lq {piVOt} 

if p=0 then 

print 'error: singular matrix' 
end if 

swap Ik and l q ; Xk.-. and x qi -; Gk } -. and G q y, cr(fc) and tr(g) 

U£ <— Gfc .-^ : ><' f or all £ = fc + 1 to 7i, £ 7^ g { "extended matrix" computa- 

tions for T*} 




^ for all £ = k + 1 to 77, £ ^ o 



u g <— (iq {non-reconstructible entry that moved off-diagonal after pivoting} 

^fc,: P~ 1 Xk,:', Xi,: ^~ Xl <: ~ lg.Xk,; for all I ^ k 

Gk, «- V 1 ( •>..:■ Gi, «- G ( , - kG k , for all £ ? k 

B- t k P _1 -B:,fc; -B;,^ — B.^ui for all £ ^ A; {the update is performed 

in the "extended matrix" fashion for B and y too} 

2/ : ,fc <- P~ 1 V-.,k\ V:,t ^ V:,t - V-.,kU£ for all £ k 
di *— dz — p~ 1 lgue for all £ = k + 1 to n 

if q ^ k then {d q may be reconstructible after the pivoting — but we store 
it explicitly anyway} 



end if 

Zfc < 1; Tifc < 1; dk <— 0{prepares to overwrite d(k) with the diagonal of 

the inverse} 

ug <— for £ — k + 1 to n {permutes the entries of u to create the right 
matching for the update of the formula©. Notice that the variable u holds 
now the entries of (i~ 1 P~ 1 ) / t.£ and I holds the entries of (C/ -1 )^.^} 
u a(t) *~ u e fo r all ^ = 1 to 77, 
di *— di + p~ 1 liui for all £ = 1 to k 
end for {continues in the next page} 
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Algorithm 6 (continued) Computing the representation of T 1 (and solving 
systems with matrix T and T*) 

{continues: last step (k = n) of the algorithm} 

l t <_ G e ,-B : , n for a u £ = i to n - 1 
< G n] -B.. t i all f = 1 to n — 1 

X n ,: <- P~ l X n ,:] %t,: ^~ — h%n,: f° r all •£ = 1 to 71 — 1 

G„ i: <- p -1 Gn, : ; G*, : <- G^, : - ZiG„, : for aU ^ = 1 to n - 1 
-B:,n *~ P^B- n, B-^ <— B. ( — B-^ut for all i = 1 to n — 1 

<- P^V-.-n, V-.,i <- - J/ : ,n«i for alH = 1 to 71 - 1 
Z n < 1 , ii n * 1 , rf n < 
u <r(i) *~ M f fo r all £ = 1 to n 
d,£ <— c?£ + p~ 1 l(U( for all ^ = 1 to n 

U:,a(i) = f° r all ^ = 1 to n {undoes the pivoting on the rows of y and £?} 

= -^M f° r all ^ = 1 to n 
return G, B, s, d {generators of the inverse} 
return x, y {solutions of Tx = b and yT = c] 
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n 


0(n 2 )-spacc GKO 


Extended matrix 


Downdating 










128 


1.262497e-15 


1.160020e-15 


1.062489e-15 








256 


1.520695c-15 


1.812447e-15 


1.463218e-15 








512 


2.979162e-15 


3.063677e-15 


3.091645e-15 








1024 


2.790466e-15 


3.429299e-15 


3.068041e-15 






PI 


2048 


4.568803e-15 


5.921849e-15 


5.044874e-15 








4096 


5.231503e-15 


7.448194e-15 


5.461259e-15 








8192 


7.491095e-15 


1.250913e-14 


7.287788e-15 








16384 


Out of memory 


1 648221 e-1 4 


1.154215c-14 








32768 


Out of memory 


9 fi949fifip 1 A 


1.757211c-14 








65536 


Out of memory 


"X QOQ'i'iQo 1 A 
o . yzyooyt 


2.209921e-14 








n 


0(n 2 )-space GKO 


Extended matrix 


Downdating 




Backslash 






128 


4.226744e-05 


4.226745e-05 


4.226745e-05 




6.943135e-05 






256 


2.498321c-03 


2.498321e-03 


2.498321e-03 




1.681902c-03 




P2 


512 


1.307574c-01 


1.307574e-01 


1.307574e-01 




2.151257c-01 






1024 


1.634538c+01 


1.634538c+01 


1.634538c+01 


1.872503c+01 




2048 


4.367616c+02 


4.367616e+02 


4.367616c+02 


2.341069c+03 




4096 


2.311074e+04 


2.311075c+04 


2.311075e+04 


1.124867c+03 




a 


<3(n 2 )-space GKO 


Extended matrix 


Downdating 




Backslash 


TOMS729 




0.85 


2.916298e-10 


1.584459e-10 


1.960486e-10 




3.083869e-ll 


1.631254c-10 




0.87 


7.080698e-10 


6.933175e-10 


6.234554e-10 




3.672145e-10 


2.736550e-09 




0.90 


1.928754e-07 


2.741270e-07 


1.807345e-07 




1.402849e-07 


3.122989e-06 


P3 


0.91 


2.690618e-04 


1.645149e-04 


2.647343e-04 




1.359575e-06 


1.208559e-04 




0.92 


8.092059e-05 


1.165346e-04 


1.540948e-04 




4.638024e-05 


1.023501e-02 




0.93 


5.766805e-03 


6.569097e-03 


6.182359e-03 




2.532194e-03 


2.486232c+00 




0.94 


3-767035e-01 


2.111118e+00 


2.837602e-01 




1.116684e+00 


1.767069c+03 



P4 



0(n 2 )-spacc GKO 


Extended matrix 


Downdating 


condest (C) 


1.678593e-01 


1.678593e-01 


1.681295e-01 


4e+04 


1.170304e-01 


1.160042e-01 


1.125991e-01 


4e+03 


2.805922e+01 


2.800674e+01 


2.797860c+01 


lc+05 


5.552484e-02 


5.173933e-02 


5.568090e-02 


lc+04 


6.540661e-02 


6.760469e-02 


7.393447e-02 


lc+03 



Table 2: Relative forward errors 



n 


A posteriori test 


Relative residual 


128 


2.6678472c-12 


2.0294501c-13 


256 


5.6104849e-12 


4.2803993e-13 


512 


8.8302484c-12 


1.6978596e-12 


1024 


1.4490236e-10 


2.9394916e-09 


2048 


6.5423206e-10 


4.9881189e-07 


4096 


6.2470657e-10 


3.5584063e-05 



Table 3: Accuracy of the a posteriori accuracy test 
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n 


Downdating(Matlab) 


Downdating(Fortran) 


Backslash(asscmbling+solving) 


128 


4.831C-02 


2.046e-03 


1.446e-02 - 


1- 2.703e-03 


256 


9.224c-02 


7.519e-03 


1.945e-02 - 


1- 1.237c-02 


512 


2.125e-01 


2.783e-02 


4.016e-02 - 


1- 6.624e-02 


1024 


5.809e-01 


1.114c-01 


1.224c-01 - 


1- 3.527e-01 


2048 


1.849c+00 


4.329e-01 


4.518e-01 4 


- 2.215e+00 


4096 


7.099c+00 


1.732e+00 


1.910e+00 - 


1- 1.426e+01 



Table 4: CPU times (in seconds) on the machine CI for the Matlab and Fortran 
implementation of downdating and for the Matlab backslash operator 



T2 



n 


Eh. 


Ei 


E 3 


condest (T) 


128 


5.4547208e-16 


1.1319021e-14 


2.9390472e-15 


5. 8720426c- 


-02 


256 


7.0728226e-16 


2.8467666e-14 


7.4006119e-15 


1.1929568e^ 


-03 


512 


9.3436757e-16 


4.1000919e-14 


1.1661137e-14 


2.4043890en 


-03 


1024 


1.3348516e-15 


1.2770834e-13 


2.9260324e-14 


4.8272132e4 


-03 


2048 


1.8731364e-15 


2.3347806e-13 


5.6916515e-14 


9.6728404eH 


-03 


4096 


2.7481278e-15 


2.5345941e-13 


7.9356066e-14 


1.9364084eH 


-04 


e 




E 2 


E 3 






le-03 


2.2655145e-ll 


5.9001177e-ll 


3.0152973c-ll 




le-06 


4.0447578e-08 


8.0919137e-08 


4.1084327e-08 




le-09 


4.0899169e-05 


8.1796690e-05 


4.1263900e-05 




le-12 


3.2571481e-02 


6.6239581e-02 


3.2914231e-02 




le-15 


1.4160667e+00 


4.8195274c+00 


1.7288419e+00 





Table 5: Accuracy of the algorithm for inverting Trummer-like matrices 
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